3 research outputs found
JDsearch: A Personalized Product Search Dataset with Real Queries and Full Interactions
Recently, personalized product search attracts great attention and many
models have been proposed. To evaluate the effectiveness of these models,
previous studies mainly utilize the simulated Amazon recommendation dataset,
which contains automatically generated queries and excludes cold users and tail
products. We argue that evaluating with such a dataset may yield unreliable
results and conclusions, and deviate from real user satisfaction. To overcome
these problems, in this paper, we release a personalized product search dataset
comprised of real user queries and diverse user-product interaction types
(clicking, adding to cart, following, and purchasing) collected from JD.com, a
popular Chinese online shopping platform. More specifically, we sample about
170,000 active users on a specific date, then record all their interacted
products and issued queries in one year, without removing any tail users and
products. This finally results in roughly 12,000,000 products, 9,400,000 real
searches, and 26,000,000 user-product interactions. We study the
characteristics of this dataset from various perspectives and evaluate
representative personalization models to verify its feasibility. The dataset
can be publicly accessed at Github: https://github.com/rucliujn/JDsearch.Comment: Accepted to SIGIR 202
RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit
Although Large Language Models (LLMs) have demonstrated extraordinary
capabilities in many domains, they still have a tendency to hallucinate and
generate fictitious responses to user requests. This problem can be alleviated
by augmenting LLMs with information retrieval (IR) systems (also known as
retrieval-augmented LLMs). Applying this strategy, LLMs can generate more
factual texts in response to user input according to the relevant content
retrieved by IR systems from external corpora as references. In addition, by
incorporating external knowledge, retrieval-augmented LLMs can answer in-domain
questions that cannot be answered by solely relying on the world knowledge
stored in parameters. To support research in this area and facilitate the
development of retrieval-augmented LLM systems, we develop RETA-LLM, a
{RET}reival-{A}ugmented LLM toolkit. In RETA-LLM, we create a complete pipeline
to help researchers and users build their customized in-domain LLM-based
systems. Compared with previous retrieval-augmented LLM systems, RETA-LLM
provides more plug-and-play modules to support better interaction between IR
systems and LLMs, including {request rewriting, document retrieval, passage
extraction, answer generation, and fact checking} modules. Our toolkit is
publicly available at https://github.com/RUC-GSAI/YuLan-IR/tree/main/RETA-LLM.Comment: Technical Report for RETA-LL
Large Language Models for Information Retrieval: A Survey
As a primary means of information acquisition, information retrieval (IR)
systems, such as search engines, have integrated themselves into our daily
lives. These systems also serve as components of dialogue, question-answering,
and recommender systems. The trajectory of IR has evolved dynamically from its
origins in term-based methods to its integration with advanced neural models.
While the neural models excel at capturing complex contextual signals and
semantic nuances, thereby reshaping the IR landscape, they still face
challenges such as data scarcity, interpretability, and the generation of
contextually plausible yet potentially inaccurate responses. This evolution
requires a combination of both traditional methods (such as term-based sparse
retrieval methods with rapid response) and modern neural architectures (such as
language models with powerful language understanding capacity). Meanwhile, the
emergence of large language models (LLMs), typified by ChatGPT and GPT-4, has
revolutionized natural language processing due to their remarkable language
understanding, generation, generalization, and reasoning abilities.
Consequently, recent research has sought to leverage LLMs to improve IR
systems. Given the rapid evolution of this research trajectory, it is necessary
to consolidate existing methodologies and provide nuanced insights through a
comprehensive overview. In this survey, we delve into the confluence of LLMs
and IR systems, including crucial aspects such as query rewriters, retrievers,
rerankers, and readers. Additionally, we explore promising directions within
this expanding field